Q-ViT: Accurate and Fully Quantized Low-Bit Vision Transformer
25
FIGURE 2.6
Attention-distance comparison for full-precision DeiT-Small, fully quantized DeiT-Small
baseline, and Q-ViT for the same input. Q-ViT shows similar behavior with the full-precision
model, while the baseline suffers indistinguishable attention distance for information degra-
dation.
full-precision counterparts as much as possible; thus, the mutual information between quan-
tized and full-precision representations [195]. As shown in [171], for the Gaussian distribu-
tion, the quantizers with the maximum output entropy (MOE) and the minimum aver-
age error (MAE) are approximately the same within a multiplicative constant. Therefore,
minimizing the error between the full precision and the quantized values is equivalent to
maximizing the information entropy of the quantized values. Thus, when the deterministic
quantization function is applied to quantized ViT, this objective is equivalent to maximiz-
ing the information entropy H(Qx) of the quantized representation Qx [171] in Eq.(2.16),
Block.0.query
Block.3.query
Block.6.query
(b) Q-ViT
Block.0.query
Block.3.query
Block.6.query
(a) Full-Precision
FIGURE 2.7
The histogram of query and key values q, k (shadow) along with the PDF curve of Gaussian
distribution N(μ, σ2) [195], for three selected layers in DeiT-T and 4-bit Q-ViT. μ and σ2
are the statistical mean and variance of the values.